Quantitative prediction error analysis to investigate predictive performance under predictor measurement heterogeneity at model implementation

Background When a predictor variable is measured in similar ways at the derivation and validation setting of a prognostic prediction model, yet both differ from the intended use of the model in practice (i.e., “predictor measurement heterogeneity”), performance of the model at implementation needs to be inferred. This study proposed an analysis to quantify the impact of anticipated predictor measurement heterogeneity. Methods A simulation study was conducted to assess the impact of predictor measurement heterogeneity across validation and implementation setting in time-to-event outcome data. The use of the quantitative prediction error analysis was illustrated using an example of predicting the 6-year risk of developing type 2 diabetes with heterogeneity in measurement of the predictor body mass index. Results In the simulation study, calibration-in-the-large of prediction models was poor and overall accuracy was reduced in all scenarios of predictor measurement heterogeneity. Model discrimination decreased with increasing random predictor measurement heterogeneity. Conclusions Heterogeneity of predictor measurements across settings of validation and implementation reduced predictive performance at implementation of prognostic models with a time-to-event outcome. When validating a prognostic model, the targeted clinical setting needs to be considered and analyses can be conducted to quantify the impact of anticipated predictor measurement heterogeneity on model performance at implementation. Supplementary Information The online version contains supplementary material available at 10.1186/s41512-022-00121-1.


Aim
We performed a simulation study to illustrate the impact of predictor measurement heterogeneity across validation and implementation setting on out-of-sample predictive performance of a survival model developed and validated in time-to-event outcome data, given that all other possible sources of discrepancy in predictive performance are not present, i.e., when there are no differences in outcome prevalence and treatment assignment policy, when there is no overfitting with respect to the derivation data, the prognostic model is correctly specified in terms of functional form and included interactions. We used (very) large samples (n = 1,000,000) to minimize the role of random simulation error.

Time-to-event data
We simulated derivation, validation, and implementation data sets with 1,000,000 observations containing a continuous predictor variable X from a standard normal distribution, which one can think of as a linear predictor or risk score that summarizes the information of a set of predictor variables. We then simulated a time-to-event outcome, i.e., an event time T and and indicator variable Y denoting the outcome event of interest, for each subject so that outcomes followed a Cox-exponential model, using methods described by Bender and colleagues [1]. The association between X and T equaled log (2), and the baseline hazard equaled 0.1. We generated data sets without censoring (median survival time t = 6.5).

Predictor measurement heterogeneity
Predictor measurement heterogeneity was recreated using measurement error models, similar to [2]. To distinguish different measurements of the same predictor, we denoted an exact measurement of the predictor (e.g., bodyweight measured on a scale) by X and a pragmatic measurement (e.g., self-reported weight) by W . Let ψ reflect the mean difference between X and W , let θ indicate the linear association between measurement X and W , and let σ 2 ϵ reflect the variance introduced by random deviations in the measurement process of W , where a larger σ 2 ϵ indicates that measurement W is less precise. We defined a general model of measurement heterogeneity for continuous predictors in line with existing measurement error literature [3,4]. Assuming that the relation between X and W is linear and additive, the association between X and W can be described as where ϵ ∼ N (0, σ 2 ϵ ) is independent of X, T , and Y . In case of ψ = 0, θ = 1, and σ 2 ϵ = 0, there is no difference between the predictor measurement procedures across the validation study and target clinical setting or predictor measurement homogeneity, i.e., E(W ) = E(X).
We assumed W to be a surrogate measurement of X, or non-differential measurement error, meaning that the contribution from the observed W is not informative for the survival time given X. Furthermore, we assume ϵ i to be independent from X, or homoscedastic measurement error.
The derivation data and validation data contained measurements of predictor X, i.e., there was predictor measurement homogeneity across derivation and validation setting. The implementation setting contained measurements of predictor W , i.e., there was predictor measurement heterogeneity across validation and implementation setting. The parameters of measurement error model (1) were varied to recreate 27 scenarios (3 x 3 x 3) of predictor measurement heterogeneity.

Prediction target
The prediction target was defined as obtaining correct predictions of the outcome risk at time point t = 6.5 conditional on predictor measurement W measured at moment of prediction (i.e., at t = 0).

Methods
Using the derivation data set, two survival models were fitted: a parametric exponential survival model and a semi-parametric Cox regression model. Although a prediction model is typically internally validated before performing external validation [5,6], we did not perform an internal validation since issues of overfitting were expected to be negligible in a sample of 1,000,000 observations. The prediction model was externally validated in a validation data set at time t = 6.5 (corresponding to the median survival time) under predictor measurement homogeneity. Furthermore, the prediction model was externally validated in various clinical implementation settings under predictor measurement heterogeneity.
Validating an parametric exponential survival model and a cox model in data under all 3 censoring mechanisms for all 27 scenarios of predictor measurement heterogeneity resulted in 162 scenarios.

Performance metrics
Predictive performance at t = 6.5 was evaluated in terms of calibration, discrimination, and overall accuracy. Calibration of the model on average, or 'calibration in the large' [7,8] was evaluated by the ratio of the observed marginal survival at t = 6.5 (obtained through a Kaplan-Meier curve) versus the predicted marginal survival at t = 6.5 (obtained by averaging predicted survival at t = 6.5 of each observation), denoted the observed / expected ratio (O/E ratio).
# Rcode corresponding to file ./R/analysis.R # evaluate time-dependent cumulative c-statistic using timeROC package c_stat <-timeROC::timeROC( T = data_val$time_event, delta = data_val$event, cause = 1, marker = lp, # linear predictor as above times = t_val)$AUC [2] Overall accuracy was evaluated by the index of prediction accuracy at t = 6.5, IPA(t), which equals a Brier score [12] at t = 6.5 that is benchmarked to a null model ignoring all patient specific information and simply predicts the empirical prevalence to each patient [13]. A perfect model has an IPA of 100%, a non-informative model has an IPA of 0% and a negative IPA indicates a harmful model.

Software
The simulation study was performed using R statistical software version 3.6.3 [14]. The simulation code is available from https://github.com/KLuijken/PMH_Survival and is structured according to the targets package [15]. The most important dependencies are the survival package for the survival functionalities and fitting the cox regression model [16], rms package for fitting the parameteric survival model [17], pec package for predicting survival risks [18], timeROC package [19] for estimating the AU C(t), and riskRegression package [20] for estimating the IPA. The simulation design was described according to Morris and colleagues [21].

Results of simulation study
Additional to the results presented in the main text, we will present descriptive results, to facilitate replication of the simulation study. Validation of the parametric exponential survival model in the validation data, i.e., under predictor measurement heterogeneity, yielded the following results. Across the three censoring scenarios, the calibration-in-thelarge coefficient (a measure of weak calibration) equaled 1, indicating good calibration. The AUC(t = 6.5) ranged from 0.74 to 0.74, indicating a discriminatory ability similar to derivation. The IPA(t = 6.5) ranged from 0.17 to 0.17, indicating an accuracy similar to derivation. Validation of the semi-parametric Cox model in the validation data, i.e., under predictor measurement heterogeneity, yielded the following results. Across the three censoring scenarios, the calibration-in-the-large coefficient (a measure of weak calibration) equaled 1, indicating good calibration. The AUC(t = 6.5) ranged from 0.74 to 0.74, indicating a discriminatory ability similar to derivation. The IPA(t = 6.5) ranged from 0.17 to 0.17, indicating an accuracy similar to derivation.

External predictive performance under predictor measurement heterogeneity
As measurement procedure W contained more random variability compared to X, i.e., a case of random measurement heterogeneity, σ ϵ > 0, the O/E ratio moved slightly under 1 ( Figure 1A). The AUC(t = 6.5) and IPA(t = 6.5) decreased as random measurement heterogeneity increased.
Additive systematic measurement heterogeneity, i.e., ψ ̸ = 0, affected the calibration-in-the-large coefficient at implementation, but minimally affected the AUC(t = 6.5) and IPA(t = 6.5) at implementation ( Figure 1B). When measurement procedure W at implementation provided a systematically higher value of the predictor compared to measurement procedure X at validation, i.e., ψ > 0, this resulted in overestimation of the average outcome incidence at implementation, and the O/E ratio < 1.
Multiplicative systematic measurement heterogeneity, i.e., θ ̸ = 1, yielded a negative calibration-in-the-large coefficient in case θ > 1 ( Figure 1C). Multiplicative systematic measurement heterogeneity minimally affected the AUC(t = 6.5) in absence of additive systematic and random measurement heterogeneity. As θ was further from 1, the IPA(t = 6.5) at implementation decreased, indicating lower overall accuracy.

Detailed results
Measures of predictive performance in all scenarios are presented to illustrate that combined random, additive systematic, and/or multiplicative systematic predictor measurement heterogeneity sometimes reinforced or cancelled out effects on predictive performance. We additionally present descriptives of the simulated implementation datasets to facilitate replication of findings.